How to come public, with private data

In a current work with Stephen Jenkins, one problem we face now that we are starting the submission process, we find that many journals require a replication package along with the paper.

Providing the code for the model estimations that we created, and even including the Code developed by others authors, is relatively simple. The problem, which many people may face, is how to distribute data that, due to privacy or proprietary reasons, we are not allowed to share.

As a matter of fact, on this particular project, only Stephen has seen the data, whereas I have worked from the far, primary on the code that estimates the new models (if interest in the research, I'm providing references to previous papers we have worked on below).

So now that is time for the "big" paper to be published, we need some strategy to construct a synthetic dataset that will fulfill all privacy protection constraints, while still transferring the moments' structure that we care about, as well as those we may not be so interested in, but may be of interest for other people.

So, with this idea in mind, I came up with a simple strategy that may help to do just that. An application of Multiple Imputation. This may not be THE best method to do this, so I'll be happy to hear any comments.

To better describe how the method works, I will use one dataset that is readily available online. This data is an excerpt of the Swiss labor market Survey 1998, which is provided as the example dataset in the command -oaxaca- (By Jann 2008).

The problem

Assume that you have signed a confidentiality agreement to work with Swiss Survey data. And are ready to submit your work, but you are asked to provide a replication package, with a code to produce the tables you have, and the dataset itself.

Since you cannot share the data, you suggest instead to provide a 5 synthetic synthetic datasets, so people can apply your code, and get to similar (if not the same) conclusions as in your main paper. Here is a piece of code you could use for that.
. ** Assumption. You have a dataset that you want to use
. clear all

. use http://fmwww.bc.edu/RePEc/bocode/o/oaxaca.dta, clear
(Excerpt from the Swiss Labor Market Survey 1998)

. misstable summarize
                                                               Obs<.
                                                +------------------------------
               |                                | Unique
      Variable |     Obs=.     Obs>.     Obs<.  | values        Min         Max
  -------------+--------------------------------+------------------------------
        lnwage |       213               1,434  |   >500    .507681    5.259097
         exper |       213               1,434  |   >500          0    49.16667
        tenure |       213               1,434  |    323          0    44.83333
          isco |       213               1,434  |      9          1           9
  -----------------------------------------------------------------------------
Four variables have missing data: Wages, tenure, experience, and ISCO, And they are missing when LFP=0

Now, I suggest creating 1 variable, that will be a "seed" that will be used to recreate synthetic datasets. It will just be a random uniform variable that will range from 0 to 100. And an ID variable.
. gen id = _n

. set seed 10101

. gen seed = runiform(0,100)
The next step is to decide how large the synthetic dataset will be. The obvious answer is to create a dataset with the same number of observations, but if you want other sample sizes, it could be adjusted. So I'll expand the dataset, duplicating observation 1, 1648 times. I will also tag the original observation:
. expand 1648 in 1, gen(tag)
(1,647 observations created)
You can now set to missing, variables with tag=1
. foreach i of varlist lnwage educ exper tenure isco female lfp age single married divorced kids6 kids714 wt {
  2.         replace `i'=. if tag==1
  3. }
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
(1,647 real changes made, 1,647 to missing)
And you will need to recreate the "seed" variable as well
. replace seed = runiform(0,100) if tag==1
(1,647 real changes made)
We may also need to set LFP since we have missing data depending on LF status
. replace lfp = runiform()<.87 if tag==1
(1,647 real changes made)
The next step is to create Multiple Imputed datasets. I believe the best strategy here is to use "pmm", because that uses the observed distribution and data types. So first mi set the data, and register all variables to be imputed:
. mi set wide

. mi register impute lnwage educ exper tenure   isco female age single married   kids6 kids714 wt
And simply impute all variables using chain pmm. Just make sure none of the variables are collinear (here colinearity exists between single, married, and divorced), and that variables with structural missing data are specified separately.

Notice as well that the explanatory variables are "seed" (fully random) and LFP (also random).
. mi impute chain (pmm, knn(100))    educ   female   age single married kids6 kids714 wt (pmm if lfp==1, knn(100) ) lnwage  exper tenure isco  = seed lfp, add(5)
note: missing-value pattern is monotone; no iteration performed

Conditional models (monotone):
              educ: pmm educ seed lfp , knn(100)
            female: pmm female educ seed lfp , knn(100)
               age: pmm age female educ seed lfp , knn(100)
            single: pmm single age female educ seed lfp , knn(100)
           married: pmm married single age female educ seed lfp , knn(100)
             kids6: pmm kids6 married single age female educ seed lfp , knn(100)
           kids714: pmm kids714 kids6 married single age female educ seed lfp , knn(100)
                wt: pmm wt kids714 kids6 married single age female educ seed lfp , knn(100)
            lnwage: pmm lnwage wt kids714 kids6 married single age female educ seed lfp if lfp==1, knn(100)
             exper: pmm exper lnwage wt kids714 kids6 married single age female educ seed lfp if lfp==1, knn(100)
            tenure: pmm tenure exper lnwage wt kids714 kids6 married single age female educ seed lfp if lfp==1, knn(100)
              isco: pmm isco tenure exper lnwage wt kids714 kids6 married single age female educ seed lfp if lfp==1, knn(100)

Performing chained iterations ...

Multivariate imputation                     Imputations =        5
Chained equations                                 added =        5
Imputed: m=1 through m=5                        updated =        0

Initialization: monotone                     Iterations =        0
                                                burn-in =        0

              educ: predictive mean matching
            female: predictive mean matching
               age: predictive mean matching
            single: predictive mean matching
           married: predictive mean matching
             kids6: predictive mean matching
           kids714: predictive mean matching
                wt: predictive mean matching
            lnwage: predictive mean matching
             exper: predictive mean matching
            tenure: predictive mean matching
              isco: predictive mean matching

------------------------------------------------------------------
                   |               Observations per m             
                   |----------------------------------------------
          Variable |   Complete   Incomplete   Imputed |     Total
-------------------+-----------------------------------+----------
              educ |       1647         1647      1647 |      3294
            female |       1647         1647      1647 |      3294
               age |       1647         1647      1647 |      3294
            single |       1647         1647      1647 |      3294
           married |       1647         1647      1647 |      3294
             kids6 |       1647         1647      1647 |      3294
           kids714 |       1647         1647      1647 |      3294
                wt |       1647         1647      1647 |      3294
            lnwage |       1434         1458      1458 |      2892
             exper |       1434         1458      1458 |      2892
            tenure |       1434         1458      1458 |      2892
              isco |       1434         1458      1458 |      2892
------------------------------------------------------------------
(complete + incomplete = total; imputed is the minimum across m
 of the number of filled-in observations.)
That's it. You have now 5 sets of variables that can be used to create unique synthetic datasets, with a similar structure as the original confidential dataset, and that could be used for replication and public use.
. forvalues i = 1/5 {
  2.         preserve
  3.                 keep if tag==1
  4.                 keep _`i'_* lfp 
  5.                 ren _`i'_* *
  6.                 save fake_oaxaca_`i', replace
  7.         restore
  8. }
(1,647 observations deleted)
(note: file fake_oaxaca_1.dta not found)
file fake_oaxaca_1.dta saved
(1,647 observations deleted)
(note: file fake_oaxaca_2.dta not found)
file fake_oaxaca_2.dta saved
(1,647 observations deleted)
(note: file fake_oaxaca_3.dta not found)
file fake_oaxaca_3.dta saved
(1,647 observations deleted)
(note: file fake_oaxaca_4.dta not found)
file fake_oaxaca_4.dta saved
(1,647 observations deleted)
(note: file fake_oaxaca_5.dta not found)
file fake_oaxaca_5.dta saved
Now let's see if this works, estimating simple an LR, CQR model, and a Heckman model.
. frame create test

. 
. frame test: {
.         use http://fmwww.bc.edu/RePEc/bocode/o/oaxaca.dta, clear
(Excerpt from the Swiss Labor Market Survey 1998)
.         qui:reg lnwage educ exper tenure female
.         est sto m1
.         qui:qreg lnwage educ exper tenure female, q(10)
.         est sto m2
.         qui:heckman lnwage educ exper tenure female age, selec(lfp =educ     female age single married kids6 kids714) two
.         est sto m3
. }

. 
. forvalues i = 1/5 { 
  2.         frame test: {
  3.                 use fake_oaxaca_`i', clear
  4.                 
.                 qui:reg lnwage educ exper tenure female
  5.                 est sto m1`i'
  6.                 qui:qreg lnwage educ exper tenure female, q(10)
  7.                 est sto m2`i'
  8.  
.                 qui:heckman lnwage educ exper tenure female age, selec(lfp =educ     female age single married kids6 kids714) two
  9.         est sto m3`i'
 10.         } 
 11. }
(Excerpt from the Swiss Labor Market Survey 1998)
(Excerpt from the Swiss Labor Market Survey 1998)
(Excerpt from the Swiss Labor Market Survey 1998)
(Excerpt from the Swiss Labor Market Survey 1998)
(Excerpt from the Swiss Labor Market Survey 1998)

. ** OLS
. esttab m1 m11 m12 m13 m14 m15, mtitle(Original Fake1 Fake2 Fake3 Fake4 Fake5)

------------------------------------------------------------------------------------------------------------
                      (1)             (2)             (3)             (4)             (5)             (6)   
                 Original           Fake1           Fake2           Fake3           Fake4           Fake5   
------------------------------------------------------------------------------------------------------------
educ               0.0848***       0.0709***       0.0643***       0.0727***       0.0640***       0.0588***
                  (16.34)         (14.21)         (12.36)         (14.67)         (13.37)         (12.32)   

exper              0.0111***      0.00913***      0.00950***      0.00995***      0.00752***       0.0110***
                   (7.22)          (6.88)          (6.29)          (6.76)          (5.16)          (7.22)   

tenure            0.00771***      0.00570***      0.00718***      0.00644***      0.00670***      0.00540** 
                   (4.10)          (3.41)          (3.70)          (3.57)          (3.75)          (2.79)   

female            -0.0841***      -0.0398         -0.1000***      -0.0711**        -0.111***      -0.0767** 
                  (-3.35)         (-1.75)         (-3.97)         (-2.85)         (-4.60)         (-3.07)   

_cons               2.213***        2.428***        2.490***        2.373***        2.542***        2.557***
                  (32.38)         (37.91)         (36.03)         (35.83)         (39.59)         (39.51)   
------------------------------------------------------------------------------------------------------------
N                    1434            1458            1458            1458            1458            1458   
------------------------------------------------------------------------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001

. ** qreg 10
. esttab m2 m21 m22 m23 m24 m25, mtitle(Original Fake1 Fake2 Fake3 Fake4 Fake5)

------------------------------------------------------------------------------------------------------------
                      (1)             (2)             (3)             (4)             (5)             (6)   
                 Original           Fake1           Fake2           Fake3           Fake4           Fake5   
------------------------------------------------------------------------------------------------------------
educ                0.103***       0.0679***       0.0738***       0.0881***       0.0703***       0.0735***
                   (6.21)          (5.48)          (5.76)          (6.61)          (7.69)          (5.52)   

exper              0.0200***      0.00867**       0.00944*         0.0147***      0.00769**        0.0196***
                   (4.06)          (2.63)          (2.54)          (3.72)          (2.76)          (4.62)   

tenure           0.000669         0.00326          0.0108*        0.00419         0.00531       -0.000442   
                   (0.11)          (0.79)          (2.26)          (0.86)          (1.56)         (-0.08)   

female             -0.151       -0.000802         -0.0706         -0.0950         -0.0970*        -0.0831   
                  (-1.87)         (-0.01)         (-1.14)         (-1.41)         (-2.10)         (-1.19)   

_cons               1.462***        2.035***        1.869***        1.702***        2.028***        1.866***
                   (6.67)         (12.81)         (10.98)          (9.55)         (16.55)         (10.32)   
------------------------------------------------------------------------------------------------------------
N                    1434            1458            1458            1458            1458            1458   
------------------------------------------------------------------------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001

. ** heckman
. esttab m3 m31 m32 m33 m34 m35, mtitle(Original Fake1 Fake2 Fake3 Fake4 Fake5)  

------------------------------------------------------------------------------------------------------------
                      (1)             (2)             (3)             (4)             (5)             (6)   
                 Original           Fake1           Fake2           Fake3           Fake4           Fake5   
------------------------------------------------------------------------------------------------------------
lnwage                                                                                                      
educ               0.0717***       0.0677***       0.0575***       0.0701***       0.0594***       0.0534***
                  (13.13)         (13.02)         (10.45)         (13.28)         (11.91)         (10.60)   

exper             0.00179         0.00264         0.00108         0.00271        -0.00226         0.00180   
                   (0.94)          (1.56)          (0.55)          (1.51)         (-1.28)          (0.94)   

tenure            0.00200         0.00199         0.00356         0.00184         0.00144         0.00130   
                   (1.01)          (1.14)          (1.79)          (0.98)          (0.79)          (0.66)   

female             -0.105***      -0.0979**        -0.154***       -0.170***       -0.204***       -0.143***
                  (-3.59)         (-3.01)         (-4.42)         (-5.30)         (-6.75)         (-4.67)   

age                0.0146***      0.00946***       0.0113***       0.0104***       0.0140***       0.0122***
                   (7.92)          (5.90)          (6.56)          (6.28)          (8.58)          (7.16)   

_cons               1.991***        2.220***        2.281***        2.134***        2.230***        2.309***
                  (27.12)         (30.89)         (30.16)         (28.82)         (31.24)         (32.32)   
------------------------------------------------------------------------------------------------------------
lfp                                                                                                         
educ                0.149***        0.133***        0.156***        0.170***        0.141***        0.160***
                   (5.37)          (4.93)          (6.11)          (6.70)          (5.59)          (6.19)   

female             -1.785***       -1.592***       -1.745***       -1.480***       -1.607***       -1.502***
                 (-11.09)        (-10.11)        (-10.37)         (-9.87)        (-10.26)         (-9.71)   

age               -0.0388***      -0.0214***     -0.00603         -0.0270***      -0.0268***      -0.0252***
                  (-5.77)         (-4.02)         (-1.07)         (-4.76)         (-4.64)         (-4.34)   

single            -0.0998          -0.583**        0.0134          -0.379          -0.483*         -0.159   
                  (-0.43)         (-3.00)          (0.07)         (-1.91)         (-2.37)         (-0.74)   

married            -0.867***       -0.698***       -0.667***       -0.605***       -0.873***       -0.705***
                  (-5.48)         (-4.14)         (-3.97)         (-3.89)         (-5.13)         (-4.48)   

kids6              -0.716***       -0.578***       -0.399***       -0.689***       -0.584***       -0.588***
                  (-8.71)         (-7.56)         (-4.98)         (-9.20)         (-7.70)         (-7.79)   

kids714            -0.343***       -0.131*         -0.177**        -0.282***       -0.206**        -0.207** 
                  (-5.26)         (-2.03)         (-2.83)         (-4.24)         (-3.19)         (-3.19)   

_cons               3.543***        2.645***        1.701***        2.438***        2.941***        2.450***
                   (7.29)          (6.14)          (4.11)          (5.94)          (6.73)          (5.66)   
------------------------------------------------------------------------------------------------------------
/mills                                                                                                      
lambda             -0.123           0.118          0.0737           0.258**         0.218**         0.142   
                  (-1.88)          (1.43)          (0.83)          (3.24)          (2.94)          (1.86)   
------------------------------------------------------------------------------------------------------------
N                    1647            1647            1647            1647            1647            1647   
------------------------------------------------------------------------------------------------------------
t statistics in parentheses
* p<0.05, ** p<0.01, *** p<0.001
What about covariances:
. frame test: {
.         use http://fmwww.bc.edu/RePEc/bocode/o/oaxaca.dta, clear
(Excerpt from the Swiss Labor Market Survey 1998)
.         mean lnwage exper tenure educ   female   age single married kids6 kids714 

Mean estimation                   Number of obs   =      1,434

--------------------------------------------------------------
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
      lnwage |   3.357604   .0140235      3.330096    3.385113
       exper |   13.15324   .2632213       12.6369    13.66958
      tenure |   7.860937   .2144401      7.440287    8.281587
        educ |   11.53696   .0639585       11.4115    11.66242
      female |   .4762901   .0131934      .4504096    .5021706
         age |   38.83891   .2915321      38.26704    39.41079
      single |   .3891213   .0128794      .3638568    .4143859
     married |   .4700139   .0131845      .4441509     .495877
       kids6 |   .2182706   .0151344      .1885826    .2479586
     kids714 |   .2782427   .0172008      .2445013     .311984
--------------------------------------------------------------
. 
.         corr lnwage exper tenure educ   female   age single married kids6 kids714 , cov
(obs=1,434)

             |   lnwage    exper   tenure     educ   female      age   single  married    kids6  kids714
-------------+------------------------------------------------------------------------------------------
      lnwage |   .28201
       exper |  1.23107  99.3553
      tenure |  1.03799  47.0903  65.9418
        educ |  .469384 -3.24851 -.510834  5.86604
      female | -.043298 -.484036 -.598583  -.14532  .249612
         age |  2.05353  79.3047  54.7529  1.62913  .213554  121.877
      single | -.061535 -1.71735 -1.22853 -.005669 -.001235 -2.87447  .237872
     married |  .044889  1.05484  .938517  .089909 -.027229  1.75406  -.18302  .249275
       kids6 |  .030479 -.557053 -.447353  .118061 -.034249 -.953649 -.077317  .096222  .328459
     kids714 |  .036036  .088118  .006038  .020763  .001368  .469835 -.099274  .100813  .018081  .424272

. 
. }

. forvalues i = 1/2 { 
  2.         frame test: {
  3.         use fake_oaxaca_`i', clear
  4. mean lnwage exper tenure educ   female   age single married kids6 kids714 
  5. 
.         corr lnwage exper tenure educ   female   age single married kids6 kids714 , cov
  6.         } 
  7. }
(Excerpt from the Swiss Labor Market Survey 1998)

Mean estimation                   Number of obs   =      1,458

--------------------------------------------------------------
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
      lnwage |   3.388626   .0123486      3.364403    3.412848
       exper |   13.41598   .2670916      12.89206    13.93991
      tenure |   7.820988   .2108657      7.407355     8.23462
        educ |   11.45045   .0597336      11.33327    11.56762
      female |   .4677641   .0130718      .4421225    .4934056
         age |   39.12826   .2939283      38.55169    39.70483
      single |   .3545953   .0125329      .3300108    .3791799
     married |   .5034294   .0130988      .4777349    .5291238
       kids6 |   .1920439   .0139977      .1645861    .2195017
     kids714 |   .3155007   .0185486      .2791158    .3518855
--------------------------------------------------------------
(obs=1,458)

             |   lnwage    exper   tenure     educ   female      age   single  married    kids6  kids714
-------------+------------------------------------------------------------------------------------------
      lnwage |  .222327
       exper |  1.01525  104.011
      tenure |  .693356  44.8687   64.829
        educ |  .338334 -2.88977 -1.34832  5.20229
      female | -.020663 -.375623 -.251828 -.083016  .249132
         age |  1.54067  84.3489  53.0935  .107939  .253623  125.962
      single | -.045191 -1.51045 -1.00048  .021702 -.008122 -2.77303  .229015
     married |  .033517  .992395  .849162  .073523  -.01945   1.8242 -.178636   .25016
       kids6 |  .015775 -.812525 -.438087   .15314 -.004787 -.966309 -.057163  .076211  .285674
     kids714 |  .020063 -.213007 -.197544  .013243 -.000118  .267674  -.08793    .1053 -.006411  .501626

(Excerpt from the Swiss Labor Market Survey 1998)

Mean estimation                   Number of obs   =      1,458

--------------------------------------------------------------
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
      lnwage |   3.362503   .0134839      3.336053    3.388953
       exper |   13.19293   .2599656      12.68298    13.70288
      tenure |   7.572988   .1992737      7.182094    7.963882
        educ |   11.51749   .0640525      11.39184    11.64313
      female |   .4718793   .0130783      .4462249    .4975337
         age |   38.69342   .2925405      38.11957    39.26726
      single |   .3909465   .0127837      .3658701    .4160229
     married |   .4814815   .0130901      .4558041    .5071589
       kids6 |   .2112483   .0149838      .1818562    .2406404
     kids714 |    .297668   .0177171      .2629142    .3324218
--------------------------------------------------------------
(obs=1,458)

             |   lnwage    exper   tenure     educ   female      age   single  married    kids6  kids714
-------------+------------------------------------------------------------------------------------------
      lnwage |  .265087
       exper |  .986651  98.5347
      tenure |  .758004  40.7152  57.8972
        educ |  .346533 -4.37921 -1.28108  5.98176
      female | -.039653 -.398898  -.37838 -.127854   .24938
         age |  1.66528  80.5948  46.2706   .23529   .24429  124.776
      single |  -.05073 -1.86626 -.965353 -.035669  .010315 -2.92399  .238271
     married |  .035125   1.2831  .562809  .104476 -.024886  1.86151 -.188363  .249828
       kids6 |  .012868 -.594919 -.426548  .155534 -.031804  -1.0601  -.06823  .085589  .327341
     kids714 |  .013946  .323909  -.05909  .070459 -.002605  .382332  -.08488  .109154 -.006645  .457661

Conclusions

As you can see, the results are going to be far from perfect replication of the original dataset. After all, we are introducing random errors to create a synthetic dataset, so other people can try to replicate our work.

With those caveats in mind, what we may end up doing is to create synthetic "fake" data like this one, along with two versions of the results. One based on the actual data, and another based on the synthetic dataset(s).

If you are interested in the code I used, you can get it (with Rep Files here)

References

Jann, Ben (2008). The Blinder-Oaxaca decomposition for linear regression models. The Stata Journal 8(4): 453-479.

Jenkins, SP, Riosā€Avila, F, 2021. "Measurement error in earnings data: replication of Meijer, Rohwedder, and Wansbeek's mixture model approach to combining survey and register data." J Appl Econ. 2021. Accepted Author Manuscript. https://doi.org/10.1002/jae.2811 (with Rep Files here)

Jenkins, Stephen P. & Rios-Avila, F, 2020. "Modelling errors in survey and administrative data on employment earnings: Sensitivity to the fraction assumed to have error-free earnings," Economics Letters, Elsevier, vol. 192(C).